knitr::opts_chunk$set( echo = FALSE, warning = FALSE, message = FALSE)

INTRODUCTION

In this assignment, it will be predicted some of products that are sold at an e-commerce platform called ‘Trendyol’. The sold count will be examined for each product and data willbe decomposed. Then, some forecasting strategies will be developed and the best among them according to their weighted mean absolute errors will be picked. The data before 23 June 2021 will be train dataset for our models to learn and data from 24 June to 30 June 2021 will be test dataset. There are 9 products that it will be examined:

  • 85004 - La Roche Posay Face Cleanser
  • 4066298 - Sleepy Baby Wipes
  • 6676673 - Xiaomi Bluetooth Headphones
  • 7061886 - Fakir Vacuum Cleaner
  • 31515569 - TrendyolMilla Tights
  • 32737302 - TrendyolMilla Bikini Top
  • 32939029 - Oral-B Rechargeable ToothBrush
  • 48740784 - Altınyıldız Classics Jacket
  • 73318567 - TrendyolMilla Bikini Top

PRODUCT 1 - La Roche Posay Face Cleanser

Before making alternative models, it should be looked at the plot of data and examined the seasonalities and trend. First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the middle of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t very significant but it is seen that the data is higher in the begining of the month and decreases to the end of the month. It can be said that there is monthly seasonality.

Before decomposing, to make better decision, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 63. We can determine the frequency as 63 just because the seasonality wasn’t very significant.

Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.

For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 1 can be chosen and looking at the PACF, for ‘p’ value 1 or 4 can be chosen as well.

The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (1,0,0) model that auto.arima is suggested is best among them. We can proceed with this model.

## 
## Call:
## arima(x = detrend, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1     ma1  intercept
##       0.6310  0.0965     0.0000
## s.e.  0.0598  0.0745     0.0579
## 
## sigma^2 estimated as 0.1279:  log likelihood = -130.47,  aic = 268.94
## [1] 268.9444
## [1] 284.177
## 
## Call:
## arima(x = detrend, order = c(4, 0, 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1  intercept
##       0.7443  -0.0417  0.0122  -0.1188  -0.0239    -0.0001
## s.e.  0.3478   0.2596  0.0682   0.0610   0.3486     0.0468
## 
## sigma^2 estimated as 0.1252:  log likelihood = -126.94,  aic = 267.88
## [1] 267.8753
## [1] 294.5323
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)            with non-zero mean : 268.7823
##  ARIMA(0,0,0)            with non-zero mean : 475.4002
##  ARIMA(1,0,0)            with non-zero mean : 268.5768
##  ARIMA(0,0,1)            with non-zero mean : 325.494
##  ARIMA(0,0,0)            with zero mean     : 473.3897
##  ARIMA(2,0,0)            with non-zero mean : 268.8005
##  ARIMA(1,0,1)            with non-zero mean : 269.0137
##  ARIMA(2,0,1)            with non-zero mean : 267.8519
##  ARIMA(3,0,1)            with non-zero mean : Inf
##  ARIMA(1,0,2)            with non-zero mean : 270.2874
##  ARIMA(3,0,0)            with non-zero mean : 269.9104
##  ARIMA(3,0,2)            with non-zero mean : Inf
##  ARIMA(2,0,1)            with zero mean     : 265.8127
##  ARIMA(1,0,1)            with zero mean     : 266.9701
##  ARIMA(2,0,0)            with zero mean     : 266.7685
##  ARIMA(3,0,1)            with zero mean     : Inf
##  ARIMA(2,0,2)            with zero mean     : 266.7251
##  ARIMA(1,0,0)            with zero mean     : 266.5464
##  ARIMA(1,0,2)            with zero mean     : 268.23
##  ARIMA(3,0,0)            with zero mean     : 267.8628
##  ARIMA(3,0,2)            with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,1)            with zero mean     : Inf
##  ARIMA(1,0,0)            with zero mean     : 266.6784
## 
##  Best model: ARIMA(1,0,0)            with zero mean
## Series: detrend 
## ARIMA(1,0,0) with zero mean 
## 
## Coefficients:
##          ar1
##       0.6819
## s.e.  0.0399
## 
## sigma^2 estimated as 0.129:  log likelihood=-131.32
## AIC=266.64   AICc=266.68   BIC=274.26
## [1] 266.642
## [1] 274.2583

Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, category_favored and basket_count attributes should be chosen.

The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceed with this model. Also, the fitted values are better for training set and it can be seen in the plots. As you can see, these fitted plots are better than the model that previously chosen.

## 
## Call:
## arima(x = detrend, order = c(1, 0, 0), xreg = xreg)
## 
## Coefficients:
##          ar1  intercept   xreg1  xreg2
##       0.7532    -0.6575  0.0015      0
## s.e.  0.0383     0.0254  0.0002    NaN
## 
## sigma^2 estimated as 0.07767:  log likelihood = -47.46,  aic = 104.92
## [1] 104.9199
## [1] 123.9606

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## [1] " Weighted Mean Absolute Percentage Error :  33.0179407746578"

PRODUCT 2 - Sleepy Baby Wipes

It should be looked at the plot of data and examined the seasonalities and trend at first. It’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the end of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant, though it can be said there is a spike in the plot at the beginning of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear.

Before decomposing, to make better decision, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 27. We can determine the frequency as 27 just because the seasonality wasn’t very significant.

Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.

For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 2 or 10 can be chosen and looking at the PACF, for ‘p’ value 2 can be chosen.

The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. So, looking at AIC and BIC values, (2,0,2) model that is suggested above is best among them. We can proceed with this model.

## 
## Call:
## arima(x = detrend2, order = c(2, 0, 2))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2  intercept
##       1.6173  -0.7246  -0.7336  -0.2664     0.0002
## s.e.  0.0441   0.0436   0.0670   0.0667     0.0022
## 
## sigma^2 estimated as 0.1442:  log likelihood = -169.07,  aic = 350.13
## [1] 350.1308
## [1] 373.5956
## 
## Call:
## arima(x = detrend2, order = c(2, 0, 10))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     ma3      ma4     ma5     ma6
##       1.5860  -0.8114  -0.7230  -0.1593  0.1764  -0.0368  0.0068  0.0767
## s.e.  0.0961   0.0941   0.1106   0.0677  0.0759   0.0678  0.0691  0.0720
##          ma7      ma8      ma9     ma10  intercept
##       0.0054  -0.1594  -0.0731  -0.1137     0.0003
## s.e.  0.0634   0.0626   0.0650   0.0726     0.0025
## 
## sigma^2 estimated as 0.1392:  log likelihood = -162.61,  aic = 353.23
## [1] 353.2274
## [1] 407.9785
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)            with non-zero mean : Inf
##  ARIMA(0,0,0)            with non-zero mean : 760.1106
##  ARIMA(1,0,0)            with non-zero mean : 443.9374
##  ARIMA(0,0,1)            with non-zero mean : 495.5242
##  ARIMA(0,0,0)            with zero mean     : 758.0933
##  ARIMA(2,0,0)            with non-zero mean : 397.8918
##  ARIMA(3,0,0)            with non-zero mean : 400.2801
##  ARIMA(2,0,1)            with non-zero mean : 397.8625
##  ARIMA(1,0,1)            with non-zero mean : 403.3121
##  ARIMA(3,0,1)            with non-zero mean : 401.3931
##  ARIMA(1,0,2)            with non-zero mean : 403.9694
##  ARIMA(3,0,2)            with non-zero mean : Inf
##  ARIMA(2,0,1)            with zero mean     : 395.8077
##  ARIMA(1,0,1)            with zero mean     : 401.2683
##  ARIMA(2,0,0)            with zero mean     : 395.8478
##  ARIMA(3,0,1)            with zero mean     : 399.3263
##  ARIMA(2,0,2)            with zero mean     : Inf
##  ARIMA(1,0,0)            with zero mean     : 441.9045
##  ARIMA(1,0,2)            with zero mean     : 401.9144
##  ARIMA(3,0,0)            with zero mean     : 398.2249
##  ARIMA(3,0,2)            with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,1)            with zero mean     : 395.0582
## 
##  Best model: ARIMA(2,0,1)            with zero mean
## Series: detrend2 
## ARIMA(2,0,1) with zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1
##       1.3194  -0.5680  -0.3469
## s.e.  0.2052   0.1465   0.2563
## 
## sigma^2 estimated as 0.1679:  log likelihood=-193.47
## AIC=394.95   AICc=395.06   BIC=410.59
## [1] 394.9483
## [1] 410.5915

Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, category_sold and basket_count attributes should be chosen.

The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceed with this model. Also, the fitted values are better for training set and it can be seen in the plots. As you can see, these fitted plots are better than the model that previously chosen.

## 
## Call:
## arima(x = detrend2, order = c(2, 0, 2), xreg = xreg2)
## 
## Coefficients:
##          ar1     ar2     ma1     ma2  intercept  xreg21  xreg22
##       0.0493  0.4027  0.7695  0.1611    -0.5382   3e-04   1e-04
## s.e.  0.2893  0.2104  0.2899  0.0723     0.0077   1e-04   0e+00
## 
## sigma^2 estimated as 0.0749:  log likelihood = -45.87,  aic = 107.74
## [1] 107.7449
## [1] 139.0313

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## [1] " Weighted Mean Absolute Percentage Error :  26.774801709638"

PRODUCT 3 - Xiaomi Bluetooth Headphones

First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. At below,looking at the plots of the product; in line graph it can be observed that the sales have variance, in some dates the plot has peaks and also there might be a cyclical behaviour which is an indicator for seasonality.

In order to proceed further,the data should be decomposed,a frequency value should be chosen. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.

The above decomposition series belong to time series with 7 and 30 days frequency, respectively.

Looking at the ACF plot of the series, highest ACF value belongs to lag 32, so time series decomposition with 32 day frequency would be sufficient.

In time series decomposition, it is assumed that the random part is randomly distributed with mean zero and standard deviation 1; in order to decide on the best frequency, the random part of the decomposed series should be observed. In this case, the random part of the decomposed time series with 7 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.

After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined.For q, peaks at ACF function can be chosen and for p values, peaks at PACF function can be chosen. Looking at the ACF, for ‘q’ value 3 or 4 may be selected and looking at the PACF, for ‘p’ value 3 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (3,0,3) model that auto.arima has suggested is best among them.

## 
## Call:
## arima(x = detrend, order = c(1, 0, 1))
## 
## Coefficients:
##           ar1     ma1  intercept
##       -0.0087  0.3157     0.0007
## s.e.   0.1050  0.0917     0.0164
## 
## sigma^2 estimated as 0.06159:  log likelihood = -9.89,  aic = 27.78
## [1] 27.78484
## [1] 43.63916
## 
## Call:
## arima(x = detrend, order = c(3, 0, 3))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2      ma3  intercept
##       0.1673  0.3160  -0.4474  -0.1977  -0.7620  -0.0403     -1e-04
## s.e.  0.1826  0.2483   0.1531   0.1866   0.2702   0.1503      2e-04
## 
## sigma^2 estimated as 0.04082:  log likelihood = 67.22,  aic = -118.44
## [1] -118.4387
## [1] -86.73004
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)           with non-zero mean : -107.9538
##  ARIMA(0,0,0)           with non-zero mean : 58.78767
##  ARIMA(1,0,0)           with non-zero mean : 36.60636
##  ARIMA(0,0,1)           with non-zero mean : 25.80072
##  ARIMA(0,0,0)           with zero mean     : 56.76901
##  ARIMA(1,0,2)           with non-zero mean : Inf
##  ARIMA(2,0,1)           with non-zero mean : -109.9478
##  ARIMA(1,0,1)           with non-zero mean : 28.34829
##  ARIMA(2,0,0)           with non-zero mean : 4.567387
##  ARIMA(3,0,1)           with non-zero mean : Inf
##  ARIMA(3,0,0)           with non-zero mean : -33.54895
##  ARIMA(3,0,2)           with non-zero mean : Inf
##  ARIMA(2,0,1)           with zero mean     : -111.095
##  ARIMA(1,0,1)           with zero mean     : 26.31239
##  ARIMA(2,0,0)           with zero mean     : 2.52807
##  ARIMA(3,0,1)           with zero mean     : Inf
##  ARIMA(2,0,2)           with zero mean     : -109.1481
##  ARIMA(1,0,0)           with zero mean     : 34.58137
##  ARIMA(1,0,2)           with zero mean     : -68.55153
##  ARIMA(3,0,0)           with zero mean     : -35.59868
##  ARIMA(3,0,2)           with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,1)           with zero mean     : Inf
##  ARIMA(2,0,1)           with non-zero mean : Inf
##  ARIMA(2,0,2)           with zero mean     : Inf
##  ARIMA(2,0,2)           with non-zero mean : Inf
##  ARIMA(1,0,2)           with zero mean     : Inf
##  ARIMA(3,0,0)           with zero mean     : -37.5118
## 
##  Best model: ARIMA(3,0,0)           with zero mean
## Series: detrend 
## ARIMA(3,0,0) with zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3
##       0.2286  -0.1907  -0.315
## s.e.  0.0481   0.0485   0.048
## 
## sigma^2 estimated as 0.0524:  log likelihood=22.81
## AIC=-37.62   AICc=-37.51   BIC=-21.76
## [1] -37.61596
## [1] -21.76165

Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The fitted values have captured partly behaviour of the series, however prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of category_sold and visit_count attributes should be chosen.

## 
## Call:
## arima(x = detrend, order = c(3, 0, 3), xreg = xreg)
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2      ma3  intercept  xreg1
##       0.1639  0.3118  -0.4518  -0.2009  -0.7632  -0.0359    -0.0267      0
## s.e.  0.1817  0.2405   0.1263   0.1865   0.2598   0.1219        NaN    NaN
##       xreg2
##           0
## s.e.    NaN
## 
## sigma^2 estimated as 0.04005:  log likelihood = 70.91,  aic = -121.82
## [1] -121.8218
## [1] -82.18603

The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## Time Series:
## Start = c(1, 1) 
## End = c(1, 7) 
## Frequency = 7 
## [1] 560.3584 522.0930 483.4829 465.1576 526.3092 548.6215 554.8162

## [1] " Weighted Mean Absolute Percentage Error :  74.2426820017771"

Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data but there is almost a constant difference between prediction and actual sales, that is probably why the WMAPE is high, in order to solve this issue, further investigation can be made.

Product 4 - 7061886

First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is no significant trend. There may be seasonality, to look further, there is a plot of 3 months of 2021 - March, April and May. The seasonality isn’t easily observed, though it can be said there is a spike in the plot at the end of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear.

30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.

The above decomposition series belong to time series with 7 and 30 days frequency, respectively.

Looking at the ACF plot of the series, highest ACF value belongs to lag 35, so time series decomposition with 35 day frequency would be sufficient.

In this case, the random part of the decomposed time series with 35 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.

Looking at the ACF, for ‘q’ value 1 or 3 may be selected and looking at the PACF, for ‘p’ value 1 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (1,0,3) model which has been built with respect to observations in ACF and PACF plots is best among them. Also it is the same model that auto arima has suggested.

## 
## Call:
## arima(x = detrend, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.6785  -0.0670     0.0078
## s.e.  0.0548   0.0696     0.0539
## 
## sigma^2 estimated as 0.1261:  log likelihood = -138.75,  aic = 285.51
## [1] 285.5067
## [1] 301.0623
## 
## Call:
## arima(x = detrend, order = c(1, 0, 3))
## 
## Coefficients:
##          ar1     ma1     ma2     ma3  intercept
##       0.3775  0.1972  0.2890  0.2218     0.0078
## s.e.  0.1158  0.1105  0.0751  0.0663     0.0502
## 
## sigma^2 estimated as 0.1216:  log likelihood = -132.26,  aic = 276.52
## [1] 276.5232
## [1] 299.8564
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)            with non-zero mean : 270.3389
##  ARIMA(0,0,0)            with non-zero mean : 472.5668
##  ARIMA(1,0,0)            with non-zero mean : 284.9426
##  ARIMA(0,0,1)            with non-zero mean : 348.2958
##  ARIMA(0,0,0)            with zero mean     : 470.6423
##  ARIMA(1,0,2)            with non-zero mean : 284.6
##  ARIMA(2,0,1)            with non-zero mean : 288.3505
##  ARIMA(3,0,2)            with non-zero mean : 277.7656
##  ARIMA(2,0,3)            with non-zero mean : 278.0155
##  ARIMA(1,0,1)            with non-zero mean : 286.0879
##  ARIMA(1,0,3)            with non-zero mean : 277.0941
##  ARIMA(3,0,1)            with non-zero mean : 273.6461
##  ARIMA(3,0,3)            with non-zero mean : 278.5149
##  ARIMA(2,0,2)            with zero mean     : 268.5336
##  ARIMA(1,0,2)            with zero mean     : 282.5616
##  ARIMA(2,0,1)            with zero mean     : 286.312
##  ARIMA(3,0,2)            with zero mean     : 275.702
##  ARIMA(2,0,3)            with zero mean     : 275.9444
##  ARIMA(1,0,1)            with zero mean     : 284.0618
##  ARIMA(1,0,3)            with zero mean     : 275.0447
##  ARIMA(3,0,1)            with zero mean     : 271.7115
##  ARIMA(3,0,3)            with zero mean     : 276.4253
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,2)            with zero mean     : Inf
##  ARIMA(2,0,2)            with non-zero mean : Inf
##  ARIMA(3,0,1)            with zero mean     : Inf
##  ARIMA(3,0,1)            with non-zero mean : Inf
##  ARIMA(1,0,3)            with zero mean     : 274.7166
## 
##  Best model: ARIMA(1,0,3)            with zero mean
## Series: detrend 
## ARIMA(1,0,3) with zero mean 
## 
## Coefficients:
##          ar1     ma1     ma2     ma3
##       0.3771  0.1976  0.2893  0.2221
## s.e.  0.1159  0.1105  0.0751  0.0662
## 
## sigma^2 estimated as 0.123:  log likelihood=-132.27
## AIC=274.55   AICc=274.72   BIC=293.99
## [1] 274.5476
## [1] 293.9919

Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The fitted values have captured partly behaviour of the series, however prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of category_favored and price attributes should be chosen.

The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.

## 
## Call:
## arima(x = detrend, order = c(1, 0, 3), xreg = xreg)
## 
## Coefficients:
##          ar1     ma1     ma2     ma3  intercept  xreg1    xreg2
##       0.3379  0.0382  0.2030  0.2168     1.9104  1e-04  -0.0088
## s.e.  0.1498  0.1440  0.0704  0.0659     0.3527    NaN   0.0012
## 
## sigma^2 estimated as 0.09467:  log likelihood = -86.96,  aic = 189.92
## [1] 189.9197
## [1] 221.0307

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## Time Series:
## Start = c(1, 1) 
## End = c(1, 7) 
## Frequency = 7 
## [1] 13.01960 11.29562 12.22475 12.54396 13.87283 15.89303 17.46858

## [1] " Weighted Mean Absolute Percentage Error :  40.265400020764"

Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data partly but it is not the best fitting, also it can be seen on the above plot that there is a peak at Mean Absolute Errors in June 25, this may be an outlier in the data, that may be the reason why the WMAPE is high, in order to solve this issue, further investigation can be made.

Product 5 - 31515569

First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a decreasing trend. There may be seasonality, to look further, there is a plot of 3 months of 2021 - March, April and May. The seasonality isn’t easily observed, though it can be said there is a spike in the plot in the middle of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.

The above decomposition series belong to time series with 7 and 30 days frequency, respectively.

Looking at the ACF plot of the series, highest ACF value belongs to lag 62, so time series decomposition with 62 day frequency would be sufficient.

In this case, the random part of the decomposed time series with 7 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.

Looking at the ACF, for ‘q’ value 1 , 3 or 5 may be selected and looking at the PACF, for ‘p’ value 2 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (2,0,5) model which has been built with respect to observations in ACF and PACF plots is best among them. Also it is better than the model that auto arima has suggested.Arima (2,0,5)’s AIC value is smaller than Arıma(1,0,1) which is suggested b auto arima.

## 
## Call:
## arima(x = detrend, order = c(2, 0, 1))
## 
## Coefficients:
##          ar1      ar2      ma1  intercept
##       1.1647  -0.6177  -1.0000      1e-04
## s.e.  0.0400   0.0403   0.0066      3e-04
## 
## sigma^2 estimated as 0.07068:  log likelihood = -39.67,  aic = 89.33
## [1] 89.33336
## [1] 109.1513
## 
## Call:
## arima(x = detrend, order = c(2, 0, 3))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     ma3  intercept
##       1.4805  -0.6927  -1.4071  -0.0561  0.4632      1e-04
## s.e.  0.0410   0.0400   0.0488   0.0822  0.0450      1e-04
## 
## sigma^2 estimated as 0.06498:  log likelihood = -24.9,  aic = 63.8
## [1] 63.79951
## [1] 91.54457
## 
## Call:
## arima(x = detrend, order = c(2, 0, 5))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     ma3     ma4      ma5  intercept
##       1.4236  -0.6318  -1.2980  -0.1782  0.3118  0.2256  -0.0611      1e-04
## s.e.  0.0986   0.0813   0.1137   0.1112  0.1157  0.0907   0.0839      1e-04
## 
## sigma^2 estimated as 0.06376:  log likelihood = -21.24,  aic = 60.48
## [1] 60.48229
## [1] 96.1545
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)           with non-zero mean : 103.7667
##  ARIMA(0,0,0)           with non-zero mean : 319.1792
##  ARIMA(1,0,0)           with non-zero mean : 256.6807
##  ARIMA(0,0,1)           with non-zero mean : 225.066
##  ARIMA(0,0,0)           with zero mean     : 317.1717
##  ARIMA(1,0,2)           with non-zero mean : 212.6558
##  ARIMA(2,0,1)           with non-zero mean : 123.8577
##  ARIMA(3,0,2)           with non-zero mean : Inf
##  ARIMA(2,0,3)           with non-zero mean : Inf
##  ARIMA(1,0,1)           with non-zero mean : 217.1248
##  ARIMA(1,0,3)           with non-zero mean : 144.7481
##  ARIMA(3,0,1)           with non-zero mean : Inf
##  ARIMA(3,0,3)           with non-zero mean : Inf
##  ARIMA(2,0,2)           with zero mean     : 102.2443
##  ARIMA(1,0,2)           with zero mean     : 210.6276
##  ARIMA(2,0,1)           with zero mean     : 122.168
##  ARIMA(3,0,2)           with zero mean     : Inf
##  ARIMA(2,0,3)           with zero mean     : Inf
##  ARIMA(1,0,1)           with zero mean     : 215.1369
##  ARIMA(1,0,3)           with zero mean     : 143.1001
##  ARIMA(3,0,1)           with zero mean     : Inf
##  ARIMA(3,0,3)           with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,2)           with zero mean     : Inf
##  ARIMA(2,0,2)           with non-zero mean : Inf
##  ARIMA(2,0,1)           with zero mean     : Inf
##  ARIMA(2,0,1)           with non-zero mean : Inf
##  ARIMA(1,0,3)           with zero mean     : Inf
##  ARIMA(1,0,3)           with non-zero mean : Inf
##  ARIMA(1,0,2)           with zero mean     : Inf
##  ARIMA(1,0,2)           with non-zero mean : Inf
##  ARIMA(1,0,1)           with zero mean     : 221.9019
## 
##  Best model: ARIMA(1,0,1)           with zero mean
## Series: detrend 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##          ar1     ma1
##       0.0506  0.4942
## s.e.  0.0794  0.0627
## 
## sigma^2 estimated as 0.1024:  log likelihood=-107.92
## AIC=221.84   AICc=221.9   BIC=233.73
## [1] 221.8396
## [1] 233.7303

Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The fitted values have captured the behaviour of the series nicely, however in peak dates for actual sales prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of favored_count and category_sold attributes should be chosen.

The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.

## 
## Call:
## arima(x = detrend, order = c(2, 0, 5), xreg = xreg)
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     ma3     ma4     ma5  intercept
##       1.3592  -0.4360  -0.9423  -0.3583  0.1100  0.3248  0.0239    -0.1683
## s.e.  0.1489   0.1385   0.1508   0.1140  0.1135  0.0733  0.0861     0.0066
##       xreg1  xreg2
##           0  1e-04
## s.e.    NaN    NaN
## 
## sigma^2 estimated as 0.06791:  log likelihood = -29.79,  aic = 81.57
## [1] 81.57108
## [1] 125.1705

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## Time Series:
## Start = c(1, 1) 
## End = c(1, 7) 
## Frequency = 7 
## [1] 265.7791 203.0056 191.3715 192.5693 209.1476 271.3931 284.5196

## [1] " Weighted Mean Absolute Percentage Error :  15.8553265197858"

Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data partly but it is not the best fitting, also it can be seen on the above plot that there is a peak at Mean Absolute Errors in June 25 and also June 27,seeing peak at error plot in Jun 25 date for different products may be an indicator for unknown or undefined campaign of Trendyol or event ,in order to solve this issue, further investigation can be made.

PRODUCT 6 - TrendyolMilla Bikini Top

It should be looked at the plot of data and examined the seasonalities and trend at first. For the empty places in sold counts, the mean of the data is taken. It’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the beginning and end of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant. In conclusion, it can be said that there is no seasonality.

Before decomposing, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 10. We can determine the frequency as 10 because there wasn’t significant seasonality.

Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.

For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 3 or 10 can be chosen and looking at the PACF, for ‘p’ value 5 can be chosen.

The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. So, looking at AIC and BIC values, (5,0,3) model that is suggested above is best among them. We can proceed with this model.

## 
## Call:
## arima(x = detrend, order = c(5, 0, 10))
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2      ma3
##       0.1038  -0.6085  -0.0238  0.0208  -0.3801  -0.4228  0.2782  -0.4734
## s.e.  0.2650   0.2766   0.2537  0.2385   0.1981   0.2638  0.2787   0.2262
##           ma4     ma5      ma6      ma7     ma8     ma9    ma10  intercept
##       -0.3826  0.0892  -0.1551  -0.1419  0.0249  0.0155  0.1680     -2e-04
## s.e.   0.2512  0.3447   0.1415   0.0947  0.0959  0.0771  0.0874      2e-04
## 
## sigma^2 estimated as 0.1485:  log likelihood = -182.55,  aic = 399.1
## [1] 399.1035
## [1] 466.3087
## 
## Call:
## arima(x = detrend, order = c(5, 0, 3))
## 
## Coefficients:
##          ar1     ar2      ar3     ar4      ar5      ma1     ma2     ma3
##       0.5383  0.2324  -0.3555  0.0084  -0.0890  -0.8667  -0.435  0.3018
## s.e.     NaN     NaN      NaN  0.0741   0.0641      NaN     NaN     NaN
##       intercept
##          -2e-04
## s.e.      2e-04
## 
## sigma^2 estimated as 0.152:  log likelihood = -186.81,  aic = 393.62
## [1] 393.6233
## [1] 433.1557
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)            with non-zero mean : 402.245
##  ARIMA(0,0,0)            with non-zero mean : 511.3063
##  ARIMA(1,0,0)            with non-zero mean : 514.1456
##  ARIMA(0,0,1)            with non-zero mean : 513.2568
##  ARIMA(0,0,0)            with zero mean     : 509.2907
##  ARIMA(1,0,2)            with non-zero mean : Inf
##  ARIMA(2,0,1)            with non-zero mean : 418.6372
##  ARIMA(3,0,2)            with non-zero mean : Inf
##  ARIMA(2,0,3)            with non-zero mean : 403.2674
##  ARIMA(1,0,1)            with non-zero mean : 515.9668
##  ARIMA(1,0,3)            with non-zero mean : Inf
##  ARIMA(3,0,1)            with non-zero mean : 403.9175
##  ARIMA(3,0,3)            with non-zero mean : Inf
##  ARIMA(2,0,2)            with zero mean     : 400.7951
##  ARIMA(1,0,2)            with zero mean     : 419.113
##  ARIMA(2,0,1)            with zero mean     : 417.3323
##  ARIMA(3,0,2)            with zero mean     : Inf
##  ARIMA(2,0,3)            with zero mean     : 401.841
##  ARIMA(1,0,1)            with zero mean     : 513.9288
##  ARIMA(1,0,3)            with zero mean     : Inf
##  ARIMA(3,0,1)            with zero mean     : 402.7315
##  ARIMA(3,0,3)            with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(2,0,2)            with zero mean     : Inf
##  ARIMA(2,0,3)            with zero mean     : Inf
##  ARIMA(2,0,2)            with non-zero mean : Inf
##  ARIMA(3,0,1)            with zero mean     : Inf
##  ARIMA(2,0,3)            with non-zero mean : Inf
##  ARIMA(3,0,1)            with non-zero mean : Inf
##  ARIMA(2,0,1)            with zero mean     : Inf
##  ARIMA(2,0,1)            with non-zero mean : Inf
##  ARIMA(1,0,2)            with zero mean     : Inf
##  ARIMA(0,0,0)            with zero mean     : 509.2907
## 
##  Best model: ARIMA(0,0,0)            with zero mean
## Series: detrend 
## ARIMA(0,0,0) with zero mean 
## 
## sigma^2 estimated as 0.2187:  log likelihood=-253.64
## AIC=509.28   AICc=509.29   BIC=513.23
## [1] 509.2802
## [1] 513.2335

Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.

The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, just basket_count attribute should be chosen.

The regressors that are chosen are added to the model. The new model’s AIC and BIC values aren’t much lower, there is a little difference betwwen the values. Therefore we can proceed with the previous model that is chosen because there wouldn’t be large difference between them.

## 
## Call:
## arima(x = detrend, order = c(5, 0, 3), xreg = xreg)
## 
## Coefficients:
##          ar1     ar2      ar3     ar4      ar5      ma1      ma2     ma3
##       0.5109  0.4048  -0.4763  0.0248  -0.0671  -0.8330  -0.6177  0.4690
## s.e.  0.2497  0.2829   0.1680  0.0796   0.0767   0.2475   0.3326  0.2223
##       intercept   xreg
##         -0.0112  1e-04
## s.e.        NaN    NaN
## 
## sigma^2 estimated as 0.1517:  log likelihood = -185.03,  aic = 392.06
## [1] 392.062
## [1] 435.5477

The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.

## [1] " Weighted Mean Absolute Percentage Error :  64.6525081574837"

Product 7 -Oral-B Rechargeable ToothBrush

The product 7 is the Oral-B Rechargeable ToothBrush and it is not expected to increase depends on season effect the product since it is daily routine product, not particularly needed for some terms, however, it is possible to observe seasonality based on economic conditions and customer purchase habits.

The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.

The sales of product 7 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.

By the time garpgh and ACF plot, it can be said that there is a trend in data.And the boxplot and histogram of data shows that there is day and month affect in data, the distibution of data differs. Finally, the ACf and Pacf plot show that there is high auto correlation in lag1 and lag7,

Examination of Attributes

The summary of data is shown below to see nature of data.

##      price         event_date         product_content_id   sold_count    
##  Min.   :110.1   Min.   :2020-05-25   Length:405         Min.   :  0.00  
##  1st Qu.:129.9   1st Qu.:2020-09-03   Class :character   1st Qu.: 20.00  
##  Median :136.3   Median :2020-12-13   Mode  :character   Median : 57.00  
##  Mean   :135.3   Mean   :2020-12-13                      Mean   : 94.91  
##  3rd Qu.:141.6   3rd Qu.:2021-03-24                      3rd Qu.:139.00  
##  Max.   :165.9   Max.   :2021-07-03                      Max.   :513.00  
##  NA's   :9                                                               
##   visit_count    favored_count   basket_count    category_sold 
##  Min.   :    0   Min.   :   0   Min.   :   0.0   Min.   : 321  
##  1st Qu.:    0   1st Qu.:   0   1st Qu.:  92.0   1st Qu.: 610  
##  Median :    0   Median : 175   Median : 240.0   Median : 802  
##  Mean   : 2267   Mean   : 356   Mean   : 399.2   Mean   :1008  
##  3rd Qu.: 4265   3rd Qu.: 588   3rd Qu.: 578.0   3rd Qu.:1099  
##  Max.   :15725   Max.   :2696   Max.   :2249.0   Max.   :5557  
##                                                                
##  category_brand_sold category_visits   ty_visits         category_basket 
##  Min.   :    0       Min.   :  346   Min.   :        1   Min.   :     0  
##  1st Qu.:    0       1st Qu.:  657   1st Qu.:        1   1st Qu.:     0  
##  Median :  693       Median :  880   Median :        1   Median :     0  
##  Mean   : 2991       Mean   : 3896   Mean   : 44737307   Mean   : 18591  
##  3rd Qu.: 5354       3rd Qu.: 1349   3rd Qu.:102143446   3rd Qu.: 41265  
##  Max.   :28944       Max.   :59310   Max.   :178545693   Max.   :281022  
##                                                                          
##  category_favored
##  Min.   : 1242   
##  1st Qu.: 2476   
##  Median : 3298   
##  Mean   : 4202   
##  3rd Qu.: 4869   
##  Max.   :44445   
## 

It can be seen that some attributes has unrealistic behaviour. the more than half of category_basket is equals zero, that is not possible since the basket count and category sold is also should in category basket and they not equal zero when category basket equals. THE TRENDYOL visit is always equals one before the particular date and it shows unusual and unrealistic that trendyol only visited once a date. Also, visit count have similar behaviour, even if it is more inclusive meaning sometimes it is less than basket count and sold count.

By examining the correlation garph and relaibility of data , “price”,“visit_count”, “basket_count”,“category_basket” , “ty_visits”,“is_campaign” are choosen as regressors.

Decomposition and Arima Models

When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.

Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.

Daily Decomposition

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0069
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.2127

The additive model give more stationary data, therefore, additive decomposition will be used in model construction.

## Series: random 
## ARIMA(0,0,1)(0,0,2)[7] with non-zero mean 
## 
## Coefficients:
##          ma1    sma1     sma2     mean
##       0.3383  0.0762  -0.1026  -0.0266
## s.e.  0.0460  0.0509   0.0501   2.1997
## 
## sigma^2 estimated as 1145:  log likelihood=-1969.4
## AIC=3948.79   AICc=3948.94   BIC=3968.73

Fortnigth Decomposition

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0071
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0196

The data is more stationary than daily decomposition. and addtive method is more stationary.

## Series: random 
## ARIMA(0,0,1)(0,0,1)[14] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1     mean
##       0.3419  -0.1142  -0.0267
## s.e.  0.0462   0.0494   2.0169
## 
## sigma^2 estimated as 1149:  log likelihood=-1970.53
## AIC=3949.05   AICc=3949.16   BIC=3965.01

Decomposition by frequency = 30

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0179
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.1052

The Additive method gave more stationary data, therefore it is used in model construction.

## Series: random 
## ARIMA(1,0,1)(0,0,2)[30] with zero mean 
## 
## Coefficients:
##          ar1     ma1     sma1     sma2
##       0.5412  0.1878  -0.2182  -0.0726
## s.e.  0.0643  0.0739   0.0580   0.0593
## 
## sigma^2 estimated as 1613:  log likelihood=-1916.29
## AIC=3842.57   AICc=3842.73   BIC=3862.21

The best result I get is the model ARIMA(4,0,0)(0,0,1)[30] wiht lowest AIC.

Model with Regressors

The regressors determined above i sused to imporve model.

## 
## Call:
## arima(x = random, order = c(4, 0, 0), seasonal = c(0, 0, 1), xreg = xreg7)
## 
## Coefficients:
##          ar1     ar2     ar3      ar4    sma1  intercept    price  visit_count
##       0.6079  0.1697  0.0780  -0.0106  0.0739    87.5287  -1.0590      -0.0141
## s.e.  0.0560  0.0627  0.0608   0.0525  0.0514    43.4787   0.3013       0.0023
##       basket_count  category_basket  ty_visits  is_campaign
##             0.2308            2e-04          0       4.9606
## s.e.        0.0119            1e-04        NaN       6.0480
## 
## sigma^2 estimated as 639.8:  log likelihood = -1744.2,  aic = 3514.41

The AIC decreased , the regressors improved the model.

Predictions

##    event_date actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-26     40            129.52970                 121.7252
## 2: 2021-06-27     46            121.41935                 112.3915
## 3: 2021-06-28     64            123.59512                 116.0881
## 4: 2021-06-29    137            125.11548                 119.9548
## 5: 2021-06-30    131            116.14164                 115.6428
## 6: 2021-07-01    130            100.83371                 102.4004
## 7: 2021-07-02    108             98.68134                 101.2189

### Model evaluation

The error rates shows the arima model have gave better results by considering WMAPE

##                  model n     mean       sd        CV      FBias      MAPE
## 1 add_arima_forecasted 7 93.71429 42.48417 0.4533372 -0.2428603 0.7599683
##       RMSE    MAD      MADP     WMAPE
## 1 51.48474 41.396 0.4417256 0.4417256
##                      model n     mean       sd        CV      FBias      MAPE
## 1 reg_add_arima_forecasted 7 93.71429 42.48417 0.4533372 -0.2033868 0.6881487
##       RMSE      MAD      MADP     WMAPE
## 1 46.49749 38.14113 0.4069937 0.4069937

Product 8 - Altinyildiz Classics Jacket

The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.

The sales of product 8 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.

The Product8 is a Jacket and it show incearsing in sales for some months. The temparature of season can have effect on it.

It can be seen that the sales is zero most of time, however, there is huge increase in October.

The ACF and PACF of data shows that there is significant autocorrelation in lag1, lag5, lag7 and lag20.

Decomposition and Arima Models

When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.

Additive Modeland Multplive Model are is used for decomposition and get stationary data.

Daily Decomposition

the data is shown decomposed by additive method and multiplicative method.

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0089
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.069

The multiplicative method gave more stationary data and the model constructed by addtive method.

## Series: random 
## ARIMA(0,0,0)(0,0,2)[7] with zero mean 
## 
## Coefficients:
##         sma1     sma2
##       0.1981  -0.0968
## s.e.  0.0495   0.0494
## 
## sigma^2 estimated as 6.751:  log likelihood=-946.38
## AIC=1898.76   AICc=1898.82   BIC=1910.72

Fortnigth Decomposition

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0061
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.173

The addtive model gave more stationary data, therefore it is used for model.

## Series: random 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##           ar1     ma1
##       -0.5945  0.7755
## s.e.   0.1097  0.0851
## 
## sigma^2 estimated as 7.911:  log likelihood=-958.2
## AIC=1922.4   AICc=1922.46   BIC=1934.3

Decompostion by frequency = 30

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0114
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.2785

The additive model is used for model construction since it is more stationary

## Series: random 
## ARIMA(0,0,1)(0,0,1)[30] with zero mean 
## 
## Coefficients:
##          ma1     sma1
##       0.2542  -0.1282
## s.e.  0.0510   0.0605
## 
## sigma^2 estimated as 8.243:  log likelihood=-926.9
## AIC=1859.79   AICc=1859.86   BIC=1871.57

ARIMA(0,0,1)(0,0,1)[30] with zero mean fives the best AIC result by comparing frequencty 7 and 14. Therefore it is used for predcitions and regressor model.

the Examination of Attirubutes

##      price         event_date         product_content_id   sold_count     
##  Min.   : -1.0   Min.   :2020-05-25   Length:405         Min.   : 0.0000  
##  1st Qu.:350.0   1st Qu.:2020-09-03   Class :character   1st Qu.: 0.0000  
##  Median :600.0   Median :2020-12-13   Mode  :character   Median : 0.0000  
##  Mean   :559.3   Mean   :2020-12-13                      Mean   : 0.9284  
##  3rd Qu.:734.3   3rd Qu.:2021-03-24                      3rd Qu.: 0.0000  
##  Max.   :833.3   Max.   :2021-07-03                      Max.   :52.0000  
##  NA's   :303                                                              
##   visit_count     favored_count     basket_count    category_sold   
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:  16.0  
##  Median :  0.00   Median : 0.000   Median :  0.00   Median :  45.0  
##  Mean   : 27.24   Mean   : 2.242   Mean   :  5.83   Mean   : 200.2  
##  3rd Qu.:  3.00   3rd Qu.: 2.000   3rd Qu.:  5.00   3rd Qu.: 111.0  
##  Max.   :516.00   Max.   :37.000   Max.   :247.00   Max.   :3299.0  
##                                                                     
##  category_brand_sold category_visits    ty_visits         category_basket  
##  Min.   :     0      Min.   :   367   Min.   :        1   Min.   :      0  
##  1st Qu.:     0      1st Qu.:  1432   1st Qu.:        1   1st Qu.:      0  
##  Median :     6      Median :  5324   Median :        1   Median :      0  
##  Mean   : 46247      Mean   : 27767   Mean   : 44737307   Mean   : 353021  
##  3rd Qu.: 94562      3rd Qu.:  9538   3rd Qu.:102143446   3rd Qu.: 464380  
##  Max.   :259590      Max.   :583672   Max.   :178545693   Max.   :3102147  
##                                                                            
##  category_favored     w_day            mon          is_campaign     
##  Min.   :  2324   Min.   :1.000   Min.   : 1.000   Min.   :0.00000  
##  1st Qu.:  8618   1st Qu.:2.000   1st Qu.: 4.000   1st Qu.:0.00000  
##  Median : 24534   Median :4.000   Median : 6.000   Median :0.00000  
##  Mean   : 33688   Mean   :4.007   Mean   : 6.464   Mean   :0.08642  
##  3rd Qu.: 50341   3rd Qu.:6.000   3rd Qu.: 9.000   3rd Qu.:0.00000  
##  Max.   :244883   Max.   :7.000   Max.   :12.000   Max.   :1.00000  
## 

the correlation of price, visit_count, and basket_count is high and it is expected if the sold_count is zero this variables can be zero.

However, it is not expected that category favored and trendyol visits is zero or one therefore these variables changed as mean.

By considering correlation and variable relaibility the “price”,“visit_count”, “basket_count”,“category_favored” are selected as regressors. And, the graph of monthly distirbution shows that mon can be effective factor, therefore, it is also added.

arima model with regressors

## 
## Call:
## arima(x = random, order = c(0, 0, 1), seasonal = c(0, 0, 1), xreg = xreg8)
## 
## Coefficients:
##          ma1    sma1  intercept    price  visit_count  basket_count
##       0.3798  0.0389     2.9376  -0.0029      -0.0040        0.1477
## s.e.  0.0308  0.0313     0.7698   0.0013       0.0019        0.0057
##       category_favored      mon
##                      0  -0.1838
## s.e.               NaN   0.0410
## 
## sigma^2 estimated as 3.173:  log likelihood = -748.72,  aic = 1515.45

the AIC got smaller when regressors added to model, and improved the model.

Predictions

##    event_date actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-25      2                    1                        1
## 2: 2021-06-26      1                    1                        1
## 3: 2021-06-27      0                    1                        1
## 4: 2021-06-28      4                    1                        1
## 5: 2021-06-29      1                    2                        2
## 6: 2021-06-30      0                    2                        2
## 7: 2021-07-01      1                    3                        3

## Model Evaluation

##                  model n     mean       sd       CV      FBias MAPE     RMSE
## 1 add_arima_forecasted 7 1.285714 1.380131 1.073435 -0.2222222  Inf 1.690309
##        MAD     MADP    WMAPE
## 1 1.428571 1.111111 1.111111
##                      model n     mean       sd       CV      FBias MAPE
## 1 reg_add_arima_forecasted 7 1.285714 1.380131 1.073435 -0.2222222  Inf
##       RMSE      MAD     MADP    WMAPE
## 1 1.690309 1.428571 1.111111 1.111111

The arima model without regressors gives better results by comparing WMAPE.

Product 9 - TrendyolMilla Bikini

The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.

The sales of product 8 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.

By observing the graph below, the month effect is clearly observable. It is expected since bikini is wore in hot seasons in Turkey. Moreover, by examined the acf and pacf graph, it can be said that there is trend in data and correlation with lag1 and lag7.

Examination of Attributes

##      price         event_date         product_content_id   sold_count    
##  Min.   :59.99   Min.   :2020-05-25   Length:405         Min.   :  0.00  
##  1st Qu.:59.99   1st Qu.:2020-09-03   Class :character   1st Qu.:  0.00  
##  Median :59.99   Median :2020-12-13   Mode  :character   Median :  0.00  
##  Mean   :60.11   Mean   :2020-12-13                      Mean   : 18.35  
##  3rd Qu.:59.99   3rd Qu.:2021-03-24                      3rd Qu.:  3.00  
##  Max.   :63.55   Max.   :2021-07-03                      Max.   :286.00  
##  NA's   :281                                                             
##   visit_count    favored_count     basket_count     category_sold 
##  Min.   :    0   Min.   :   0.0   Min.   :   0.00   Min.   :  20  
##  1st Qu.:    0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 132  
##  Median :    0   Median :   0.0   Median :   0.00   Median : 563  
##  Mean   : 2457   Mean   : 240.8   Mean   :  88.64   Mean   :1301  
##  3rd Qu.:  589   3rd Qu.: 112.0   3rd Qu.:  19.00   3rd Qu.:1676  
##  Max.   :45833   Max.   :5011.0   Max.   :1735.00   Max.   :8099  
##                                                                   
##  category_brand_sold category_visits     ty_visits         category_basket  
##  Min.   :     0      Min.   :    107   Min.   :        1   Min.   :      0  
##  1st Qu.:     0      1st Qu.:    397   1st Qu.:        1   1st Qu.:      0  
##  Median :  2965      Median :   1362   Median :        1   Median :      0  
##  Mean   : 14028      Mean   :  82604   Mean   : 44737307   Mean   : 118415  
##  3rd Qu.: 15079      3rd Qu.:   2871   3rd Qu.:102143446   3rd Qu.: 101167  
##  Max.   :152168      Max.   :1335060   Max.   :178545693   Max.   :1230833  
##                                                                             
##  category_favored     w_day            mon          is_campaign     
##  Min.   :   628   Min.   :1.000   Min.   : 1.000   Min.   :0.00000  
##  1st Qu.:  2589   1st Qu.:2.000   1st Qu.: 4.000   1st Qu.:0.00000  
##  Median :  7843   Median :4.000   Median : 6.000   Median :0.00000  
##  Mean   : 15287   Mean   :4.007   Mean   : 6.464   Mean   :0.08642  
##  3rd Qu.: 16401   3rd Qu.:6.000   3rd Qu.: 9.000   3rd Qu.:0.00000  
##  Max.   :135551   Max.   :7.000   Max.   :12.000   Max.   :1.00000  
## 

the “price”,“category_sold”, “basket_count”,“category_favored” attributes are more relaible and significantly corralet with data. Even if the visit_count and favored_count is very high corralled with data, they also corraleted with basket_count therefore they do not used as regressors. And “mon” is seem to affect on sales by th monthly distribution graph.

Decomposition and Arima Models

When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.

Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.

Daily Decomposition

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0082
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.083

The addtive method give more stationary results therefore it is used for models.

## Series: random 
## ARIMA(0,0,2)(0,0,2)[7] with zero mean 
## 
## Coefficients:
##          ma1      ma2    sma1    sma2
##       0.0016  -0.2206  0.1231  0.1535
## s.e.  0.0700   0.0800  0.0532  0.0552
## 
## sigma^2 estimated as 103.2:  log likelihood=-1489.39
## AIC=2988.77   AICc=2988.93   BIC=3008.72

Fortnigth Decomposition

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0087
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.2302

the additive model give more stationary results and it is used for model construction.

## Series: random 
## ARIMA(1,0,0)(0,0,2)[14] with zero mean 
## 
## Coefficients:
##          ar1    sma1     sma2
##       0.4729  0.0831  -0.1172
## s.e.  0.0445  0.0509   0.0593
## 
## sigma^2 estimated as 158.4:  log likelihood=-1543.96
## AIC=3095.93   AICc=3096.03   BIC=3111.8

Decompostion by frequency = 30

## [1] "the additive model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.028
## [1] "the multiplicative model"
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.6099

“the additive model” give more stationary data, therefore it is used as model data.

## Series: random 
## ARIMA(1,0,0)(0,0,2)[30] with zero mean 
## 
## Coefficients:
##          ar1     sma1     sma2
##       0.6941  -0.1337  -0.1976
## s.e.  0.0370   0.0670   0.1009
## 
## sigma^2 estimated as 185.5:  log likelihood=-1511.85
## AIC=3031.71   AICc=3031.81   BIC=3047.41

the ARIMA(0,0,1)(0,0,1)[30] with zero mean model gives lower AIC value, therefor it is used for regressive model.

Model with Regressors

## 
## Call:
## arima(x = random, order = c(0, 0, 1), seasonal = c(0, 0, 1), xreg = xreg8)
## 
## Coefficients:
##          ma1     sma1  intercept    price  visit_count  basket_count
##       0.5569  -0.1643     0.3158  -0.0041       0.0413        0.0201
## s.e.  0.0343   0.0817     6.5663   0.0106       0.0179        0.0489
##       category_favored     mon
##                  0e+00  0.0841
## s.e.             1e-04  0.3520
## 
## sigma^2 estimated as 231.4:  log likelihood = -1553.47,  aic = 3124.94

the AIC value i slowerr than arima mode, so it is improved model.

Predictions

##    event_date Actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-25     20             33.04259                 33.47467
## 2: 2021-06-26     27             28.24629                 28.63710
## 3: 2021-06-27     20             31.74235                 32.14315
## 4: 2021-06-28     26             32.17666                 32.60195
## 5: 2021-06-29     19             28.66079                 29.05578
## 6: 2021-06-30     20             25.90133                 26.29765
## 7: 2021-07-01     14             24.36589                 24.74094

## Model Evaluation

##                  model n     mean       sd       CV      FBias      MAPE
## 1 add_arima_forecasted 7 20.85714 4.413184 0.211591 -0.3981911 0.4381313
##       RMSE      MAD      MADP     WMAPE
## 1 9.128483 8.305128 0.3981911 0.3981911
##                      model n     mean       sd       CV      FBias      MAPE
## 1 reg_add_arima_forecasted 7 20.85714 4.413184 0.211591 -0.4174742 0.4581128
##       RMSE      MAD      MADP     WMAPE
## 1 9.497635 8.707319 0.4174742 0.4174742

The arima model with no regressors gives better result in test data.

CONCLUSION

In order to find best decomposition level and with respect to that find the best ARIMA models for different products, different decomposition levels have tried and selected, then ARIMA models have been tried and their performance on the test set have been measured, which consists of dates from 24 June 2021 to 30 June 2021, different models have been selected for each product.

Since sales are affected from the overall component of the economy, so more external data could be included such as dollar exchange rate, for improved accuracy.

Approaching differently to each product is one of the strong sides of the model, since it is a time consuming task. Also comparing AIC values for the model auto arima suggested with the models that have been selected with respect to ACF and PACF plots and measuring their performances based on their predictions on the test data is also a strong side of the models that have been proposed for each product.

Overall, it can be said that models work fine, deviation from the real values is not too big.

REFERENCES

Lecture Notes

RMD

The code of my study is available from here